62 research outputs found
How good are detection proposals, really?
Current top performing Pascal VOC object detectors employ detection proposals
to guide the search for objects thereby avoiding exhaustive sliding window
search across images. Despite the popularity of detection proposals, it is
unclear which trade-offs are made when using them during object detection. We
provide an in depth analysis of ten object proposal methods along with four
baselines regarding ground truth annotation recall (on Pascal VOC 2007 and
ImageNet 2013), repeatability, and impact on DPM detector performance. Our
findings show common weaknesses of existing methods, and provide insights to
choose the most adequate method for different settings
Learning non-maximum suppression
Object detectors have hugely profited from moving towards an end-to-end
learning paradigm: proposals, features, and the classifier becoming one neural
network improved results two-fold on general object detection. One
indispensable component is non-maximum suppression (NMS), a post-processing
algorithm responsible for merging all detections that belong to the same
object. The de facto standard NMS algorithm is still fully hand-crafted,
suspiciously simple, and -- being based on greedy clustering with a fixed
distance threshold -- forces a trade-off between recall and precision. We
propose a new network architecture designed to perform NMS, using only boxes
and their score. We report experiments for person detection on PETS and for
general object categories on the COCO dataset. Our approach shows promise
providing improved localization and occlusion handling.Comment: Added "Supplementary material" titl
What makes for effective detection proposals?
Current top performing object detectors employ detection proposals to guide
the search for objects, thereby avoiding exhaustive sliding window search
across images. Despite the popularity and widespread use of detection
proposals, it is unclear which trade-offs are made when using them during
object detection. We provide an in-depth analysis of twelve proposal methods
along with four baselines regarding proposal repeatability, ground truth
annotation recall on PASCAL, ImageNet, and MS COCO, and their impact on DPM,
R-CNN, and Fast R-CNN detection performance. Our analysis shows that for object
detection improving proposal localisation accuracy is as important as improving
recall. We introduce a novel metric, the average recall (AR), which rewards
both high recall and good localisation and correlates surprisingly well with
detection performance. Our findings show common strengths and weaknesses of
existing methods, and provide insights and metrics for selecting and tuning
proposal methods.Comment: TPAMI final version, duplicate proposals removed in experiment
Taking a Deeper Look at Pedestrians
In this paper we study the use of convolutional neural networks (convnets)
for the task of pedestrian detection. Despite their recent diverse successes,
convnets historically underperform compared to other pedestrian detectors. We
deliberately omit explicitly modelling the problem into the network (e.g. parts
or occlusion modelling) and show that we can reach competitive performance
without bells and whistles. In a wide range of experiments we analyse small and
big convnets, their architectural choices, parameters, and the influence of
different training data, including pre-training on surrogate tasks.
We present the best convnet detectors on the Caltech and KITTI dataset. On
Caltech our convnets reach top performance both for the Caltech1x and
Caltech10x training setup. Using additional data at training time our strongest
convnet model is competitive even to detectors that use additional data
(optical flow) at test time
Person Recognition in Personal Photo Collections
Recognising persons in everyday photos presents major challenges (occluded
faces, different clothing, locations, etc.) for machine vision. We propose a
convnet based person recognition system on which we provide an in-depth
analysis of informativeness of different body cues, impact of training data,
and the common failure modes of the system. In addition, we discuss the
limitations of existing benchmarks and propose more challenging ones. Our
method is simple and is built on open source and open data, yet it improves the
state of the art results on a large dataset of social media photos (PIPA).Comment: Accepted to ICCV 2015, revise
Lucid Data Dreaming for Video Object Segmentation
Convolutional networks reach top quality in pixel-level video object
segmentation but require a large amount of training data (1k~100k) to deliver
such results. We propose a new training strategy which achieves
state-of-the-art results across three evaluation datasets while using 20x~1000x
less annotated data than competing methods. Our approach is suitable for both
single and multiple object segmentation. Instead of using large training sets
hoping to generalize across domains, we generate in-domain training data using
the provided annotation on the first frame of each video to synthesize ("lucid
dream") plausible future video frames. In-domain per-video training data allows
us to train high quality appearance- and motion-based models, as well as tune
the post-processing stage. This approach allows to reach competitive results
even when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the video object segmentation task a smaller
training set that is closer to the target domain is more effective. This
changes the mindset regarding how many training samples and general
"objectness" knowledge are required for the video object segmentation task.Comment: Accepted in International Journal of Computer Vision (IJCV
Exploiting saliency for object segmentation from image level labels
There have been remarkable improvements in the semantic labelling task in the
recent years. However, the state of the art methods rely on large-scale
pixel-level annotations. This paper studies the problem of training a
pixel-wise semantic labeller network from image-level annotations of the
present object classes. Recently, it has been shown that high quality seeds
indicating discriminative object regions can be obtained from image-level
labels. Without additional information, obtaining the full extent of the object
is an inherently ill-posed problem due to co-occurrences. We propose using a
saliency model as additional information and hereby exploit prior knowledge on
the object extent and image statistics. We show how to combine both information
sources in order to recover 80% of the fully supervised performance - which is
the new state of the art in weakly supervised training for pixel-wise semantic
labelling. The code is available at https://goo.gl/KygSeb.Comment: CVPR 201
Design of an urban driverless ground vehicle
International audienceThis paper presents the design and implementation of a driverless car for populated urban environments. We propose a system that explicitly map the static obstacles, detects and track the moving obstacle, consider the unobserved areas, provide a motion plan with safety guarantees and executes it. All of it was implemented and integrated into a single computer maneuvering on real time an electric vehicle into an unvisited area with moving obstacles. The overview of the algorithms and some experimental results are presented
- …